{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "ac596f76",
   "metadata": {},
   "source": [
    "# SplitREC: Using DeepFM Algorithm in SecretFlow"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "50f728e5",
   "metadata": {},
   "source": [
    ">The following codes are demos only. It's NOT for production due to system security concerns, please DO NOT use it directly in production."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "cc8a13a1",
   "metadata": {},
   "outputs": [],
   "source": [
    "%load_ext autoreload\n",
    "%autoreload 2"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b9a4b7b2",
   "metadata": {},
   "source": [
    "SecretFlow framework provides SLModel to meet user's needs for Vertical Federated Learning. In the vertical scenario, all parties achieve the purpose of training a better model through complementary features and joint training. In actual application scenarios, the recommendation scenario fits well with the vertical federation solution. It has great application prospects.\n",
    "Different data holders have different features and are unwilling to share with each other, but the features are complementary, such as consumer features, financial features and user portraits, etc. However, recommendation algorithms are often not so directly applicable to split learning. For example, the FM algorithm requires feature crossover.\n",
    "Therefore, we provide a special item in SecretFlow, by encapsulating commonly used recommendation algorithms, to facilitate users to use federated learning for recommendation applications.\n",
    "This article will introduce how to use the DeepFM algorithm in SecretFlow."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a5422a5e",
   "metadata": {},
   "source": [
    "## Introduction to the principle of DeepFM splitting"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d550c385",
   "metadata": {},
   "source": [
    "The DeepFM algorithm combines the strengths of FM and neural networks, can improve low-dimensional and high-dimensional features at the same time. Compared with the Wide&Deep model, DeepFM algorithm also eliminates the feature engineering part.\n",
    "\n",
    "![deepfm_algo](./resources/deepfm_algo.png)\n",
    "\n",
    "From the overall structure, This model can be divided into two parts, namely the FM part and the Deep part. The input of these two parts is the same, and there is no distinction like the Wide & Deep model. The Deep part is used to train the high-dimensional correlations of these features, while the FM model calculates the two-dimensional cross information between features through the hidden vector v.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ac1b6e4a",
   "metadata": {},
   "source": [
    "## DeepFM On SecretFlow"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a6706e7b",
   "metadata": {},
   "source": [
    "Split Scheme"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b76e16df",
   "metadata": {},
   "source": [
    "DeepFM formula derivation\n",
    "$$\\hat{y}=w_0 + \\sum_{i}^nW_iX_i + \\sum_{i=1}^{n-1}\\sum_{j=i+1}^nw_{ij}x_ix_j$$\n",
    "$$\\hat{y}=w_0 + \\sum_{i}^nW_iX_i + \\sum_{i=1}^{n-1}\\sum_{j=1}^nv_i^Tv_jx_i,x_j$$"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d453e82b",
   "metadata": {},
   "source": [
    "Mathematical simplification\n",
    "$$\n",
    "\\begin{align}\n",
    "\\sum_{i=1}^n\\sum_{j=i+1}^nv_i^Tv_jx_ix_j &= \\frac12\\sum_{i=1}^n\\sum_{j=1}^nv_i^Tv_jx_ix_j - \n",
    "\\frac12\\sum_{i=1}^nv_i^Tv_jx_ix_j  \\\\\n",
    "&= \\frac12(\\sum_{i=1}^n\\sum_{j=1}^n\\sum_{k=1}^nv_{i,f}v_{j,f}x_ix_j - \\sum_{i=1}^n\\sum_{f=1}^nv_{i,f}v_{j,f}x_ix_i) \\\\\n",
    "&= \\frac12\\sum_{f=1}^k[(\\sum_{i=1}^nv_{i,f}x_i)(\\sum_{j=1}^nv_{j,f}x_j) - (\\sum_{i=1}^nv_{i,f}^2x_i^2)]\n",
    "\\\\ \n",
    "&= \\frac12\\sum_{f=1}^k[(\\sum_{i=1}^nv_{i,f}x_i)^2 - \\sum_{i=1}^nv_{i,f}^2x_i^2] \\\\\n",
    "\\end{align} \n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "423d2cc0",
   "metadata": {},
   "source": [
    "end up with the formula\n",
    "$$\n",
    "\\hat{y} = w_0 + \\sum_{i=1}^nw_ix_i + \\frac12[\\sum{f_1}^k(\\sum_{i=1}^nv_{i,f}x_i)^2-\\sum_{i=1}^nv_{i,f}^2x_i^2]\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "78d9c7f4",
   "metadata": {},
   "source": [
    "## DeepFM split version"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8b3e961c",
   "metadata": {},
   "source": [
    "It can be seen from the derivation of the above formula that $V_j$ is eliminated in the calculation of FM\n",
    "$$\n",
    "\\hat{y} = w_0 + \\sum_{i=1}^nw_ix_i + \\frac12[\\sum{f_1}^k(\\sum_{i=1}^nv_{i,f}x_i)^2-\\sum_{i=1}^nv_{i,f}^2x_i^2]\n",
    "$$ \n",
    "So the problem turns to as long as each party calculates their own $\\sum_{i}^kViX$ and $\\sum_{i}^kV_i^2X^2$, continue to simplify this formula to get \n",
    "\n",
    "$$\\hat{y}=w_0+\\sum_{i=1}^nw_ix_i +\\frac{1}{2}[\\sum_{f=1}^k(\\sum_{i=1}^nV_{i,f}x_i)^2 - \\sum_{f=1}^k\\sum_{i=1}^nV_{i,f}^2{x_i}^2]$$  \n",
    "Extended to 2 parties, assuming that alice and bob have Na and Nb features respectively, then\n",
    "\n",
    "$$\\hat{y}=w_{a0}+w_{b0}+\\sum_{i=1}^{Na}w_ix_i + \\sum_{i=1}^{Nb}w_ix_i +\\frac{1}{2}[\\sum_{f=1}^k(\\sum_{i=1}^{Na}V_{i,f}x_i+\\sum_{i=1}^{Nb}V_{i,f}x_i)^2 - (\\sum_{f=1}^k\\sum_{i=1}^{Na}V_{i,f}^2{x_i}^2+\\sum_{f=1}^k\\sum_{i=1}^{Nb}V_{i,f}^2{x_i}^2)]\n",
    "$$\n",
    "\n",
    "It can be seen from the derivation of the formula that each participant only needs to calculate the first order \n",
    "$\\sum_{i=1}^aV_{i,f}x_i$ and $\\sum_{i=1}^aV_{i,f}^2{x_i}^2$ \n",
    "locally, and then transmit the two parts (1) and (2) to fusenet to complete the lossless fm part calculation \n",
    "$$w_{a0}+\\sum_{i=1}^aw_ix_i+\\sum_{i=1}^aV_{i,f}^2{x_i}^2 \\tag 1$$\n",
    "$$\\sum_{i=1}^aV_{i,f}x_i \\tag 2$$  \n",
    "\n",
    "The dimension of the original V_vector is `batchsize*Na*k`, and the dimension of Bob is `batchsize*Nb*k`      \n",
    "The data actually transmitted after derivation:       \n",
    "- The dimension of (1) is `batchsize*1`, which will not leak the information of the v matrix        \n",
    "- The dimension of (2) is `batchsize*K`, the information on the feature dimension has been eliminated, and only the sum information on the FM vector dimension is retained (usually dim=4)       "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "750e0bf7",
   "metadata": {},
   "source": [
    "  ![deepfm_path](./resources/deepfm_plan.jpg)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "674b61fb",
   "metadata": {},
   "source": [
    "## SecretFlow encapsulation"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6eb7aa9d",
   "metadata": {},
   "source": [
    "Our application in SecretFlow provides packaging for various applications.\n",
    "The encapsulation of DeepFM is in secretflow/ml/nn/applications/sl_deep_fm.py, which provides `DeepFMBase` and `DeepFMFuse` two modules.  \n",
    "**DeepFMBase**  \n",
    "```python\n",
    "class DeepFMbase(tf.keras.Model):\n",
    "    def __init__(\n",
    "        self,\n",
    "        dnn_units_size,\n",
    "        dnn_activation=\"relu\",\n",
    "        preprocess_layer=None,\n",
    "        fm_embedding_dim=16,\n",
    "        **kwargs,\n",
    "    ):\n",
    "        \"\"\"Split learning version of DeepFM\n",
    "        Args:\n",
    "            dnn_units_size: list,list of positive integer or empty list, the layer number and units in each layer of DNN\n",
    "            dnn_activation: activation function of dnn part\n",
    "            preprocess_layer: The preprocessed layer a keras model, output a dict of preprocessed data\n",
    "            fm_embedding_dim: fm embedding dim, default to be 16\n",
    "\n",
    "```\n",
    "**DeepFMFuse**\n",
    "```python\n",
    "class DeepFMfuse(tf.keras.Model):\n",
    "    def __init__(self, dnn_units_size, dnn_activation=\"relu\", **kwargs):\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9e1f0977",
   "metadata": {},
   "source": [
    "**Let's take an example to see how to use DeepFM packaged in SecretFlow for training and prediction**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3dcb9b8a",
   "metadata": {},
   "source": [
    "## Environment settings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "26eeba56",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The version of SecretFlow: 1.0.0a0\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2023-07-25 11:26:15,683\tINFO worker.py:1538 -- Started a local Ray instance.\n"
     ]
    }
   ],
   "source": [
    "import secretflow as sf\n",
    "\n",
    "# Check the version of your SecretFlow\n",
    "print('The version of SecretFlow: {}'.format(sf.__version__))\n",
    "\n",
    "# In case you have a running secretflow runtime already.\n",
    "sf.shutdown()\n",
    "sf.init(['alice', 'bob', 'charlie'], address=\"local\", log_to_driver=False)\n",
    "alice, bob, charlie = sf.PYU('alice'), sf.PYU('bob'), sf.PYU('charlie')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f14abd26",
   "metadata": {},
   "source": [
    "## Dataset introduction"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "08db69c8",
   "metadata": {},
   "source": [
    "We will use the most classic **MovieLens** dataset for demonstration here. \n",
    "**MovieLens** is an open recommendation system dataset that includes movie ratings and movie metadata information.      \n",
    "[Dataset address](https://secretflow-data.oss-accelerate.aliyuncs.com/datasets/movielens/ml-1m.zip)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8b46ed47",
   "metadata": {},
   "source": [
    "We split the data:     \n",
    "- alice: \"UserID\", \"Gender\", \"Age\", \"Occupation\", \"Zip-code\"       \n",
    "- bob:   \"MovieID\", \"Rating\", \"Title\", \"Genres\", \"Timestamp\"      "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "431bccb5",
   "metadata": {},
   "source": [
    "For details about DataBuilder, please see CustomDataLoaderOnSL"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e3dd5e16",
   "metadata": {},
   "source": [
    "## Download and process data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "79d8f42a",
   "metadata": {},
   "source": [
    "Data splitting"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "b88c15f4",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%capture\n",
    "%%!\n",
    "wget https://secretflow-data.oss-accelerate.aliyuncs.com/datasets/movielens/ml-1m.zip\n",
    "unzip ./ml-1m.zip "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "d8ee8592",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Read the data in dat format and convert it into a dictionary\n",
    "def load_data(filename, columns):\n",
    "    data = {}\n",
    "    with open(filename, \"r\", encoding=\"unicode_escape\") as f:\n",
    "        for line in f:\n",
    "            ls = line.strip(\"\\n\").split(\"::\")\n",
    "            data[ls[0]] = dict(zip(columns[1:], ls[1:]))\n",
    "    return data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "e7836104",
   "metadata": {},
   "outputs": [],
   "source": [
    "fed_csv = {alice: \"alice_ml1m.csv\", bob: \"bob_ml1m.csv\"}\n",
    "csv_writer_container = {alice: open(fed_csv[alice], \"w\"), bob: open(fed_csv[bob], \"w\")}\n",
    "part_columns = {\n",
    "    alice: [\"UserID\", \"Gender\", \"Age\", \"Occupation\", \"Zip-code\"],\n",
    "    bob: [\"MovieID\", \"Rating\", \"Title\", \"Genres\", \"Timestamp\"],\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "b10032b9",
   "metadata": {},
   "outputs": [],
   "source": [
    "for device, writer in csv_writer_container.items():\n",
    "    writer.write(\"ID,\" + \",\".join(part_columns[device]) + \"\\n\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "fd6241f8",
   "metadata": {},
   "outputs": [],
   "source": [
    "f = open(\"ml-1m/ratings.dat\", \"r\", encoding=\"unicode_escape\")\n",
    "\n",
    "users_data = load_data(\n",
    "    \"./ml-1m/users.dat\",\n",
    "    columns=[\"UserID\", \"Gender\", \"Age\", \"Occupation\", \"Zip-code\"],\n",
    ")\n",
    "movies_data = load_data(\"./ml-1m/movies.dat\", columns=[\"MovieID\", \"Title\", \"Genres\"])\n",
    "ratings_columns = [\"UserID\", \"MovieID\", \"Rating\", \"Timestamp\"]\n",
    "\n",
    "rating_data = load_data(\"./ml-1m/ratings.dat\", columns=ratings_columns)\n",
    "\n",
    "\n",
    "def _parse_example(feature, columns, index):\n",
    "    if \"Title\" in feature.keys():\n",
    "        feature[\"Title\"] = feature[\"Title\"].replace(\",\", \"_\")\n",
    "    if \"Genres\" in feature.keys():\n",
    "        feature[\"Genres\"] = feature[\"Genres\"].replace(\"|\", \" \")\n",
    "    values = []\n",
    "    values.append(str(index))\n",
    "    for c in columns:\n",
    "        values.append(feature[c])\n",
    "    return \",\".join(values)\n",
    "\n",
    "\n",
    "index = 0\n",
    "num_sample = 1000\n",
    "for line in f:\n",
    "    ls = line.strip().split(\"::\")\n",
    "    rating = dict(zip(ratings_columns, ls))\n",
    "    rating.update(users_data.get(ls[0]))\n",
    "    rating.update(movies_data.get(ls[1]))\n",
    "    for device, columns in part_columns.items():\n",
    "        parse_f = _parse_example(rating, columns, index)\n",
    "        csv_writer_container[device].write(parse_f + \"\\n\")\n",
    "    index += 1\n",
    "    if num_sample > 0 and index >= num_sample:\n",
    "        break\n",
    "for w in csv_writer_container.values():\n",
    "    w.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2fc4f07d",
   "metadata": {},
   "source": [
    "#### So far we have completed the data processing and splitting\n",
    "Produced     \n",
    "```    \n",
    "alice: alice_ml1m.csv    \n",
    "bob: bob_ml1m.csv     \n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "8ca24698",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ID,UserID,Gender,Age,Occupation,Zip-code\r\n",
      "0,1,F,1,10,48067\r\n",
      "1,1,F,1,10,48067\r\n",
      "2,1,F,1,10,48067\r\n",
      "3,1,F,1,10,48067\r\n",
      "4,1,F,1,10,48067\r\n",
      "5,1,F,1,10,48067\r\n",
      "6,1,F,1,10,48067\r\n",
      "7,1,F,1,10,48067\r\n",
      "8,1,F,1,10,48067\r\n"
     ]
    }
   ],
   "source": [
    "! head alice_ml1m.csv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "5f7602d7",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ID,MovieID,Rating,Title,Genres,Timestamp\r\n",
      "0,1193,5,One Flew Over the Cuckoo's Nest (1975),Drama,978300760\r\n",
      "1,661,3,James and the Giant Peach (1996),Animation Children's Musical,978302109\r\n",
      "2,914,3,My Fair Lady (1964),Musical Romance,978301968\r\n",
      "3,3408,4,Erin Brockovich (2000),Drama,978300275\r\n",
      "4,2355,5,Bug's Life_ A (1998),Animation Children's Comedy,978824291\r\n",
      "5,1197,3,Princess Bride_ The (1987),Action Adventure Comedy Romance,978302268\r\n",
      "6,1287,5,Ben-Hur (1959),Action Adventure Drama,978302039\r\n",
      "7,2804,5,Christmas Story_ A (1983),Comedy Drama,978300719\r\n",
      "8,594,4,Snow White and the Seven Dwarfs (1937),Animation Children's Musical,978302268\r\n"
     ]
    }
   ],
   "source": [
    "! head bob_ml1m.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5f313c7b",
   "metadata": {},
   "source": [
    "## Build data_builder_dict"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "6f8cc8cb",
   "metadata": {},
   "outputs": [],
   "source": [
    "# alice\n",
    "def create_dataset_builder_alice(\n",
    "    batch_size=128,\n",
    "    repeat_count=5,\n",
    "):\n",
    "    def dataset_builder(x):\n",
    "        import pandas as pd\n",
    "        import tensorflow as tf\n",
    "\n",
    "        x = [dict(t) if isinstance(t, pd.DataFrame) else t for t in x]\n",
    "        x = x[0] if len(x) == 1 else tuple(x)\n",
    "        data_set = (\n",
    "            tf.data.Dataset.from_tensor_slices(x).batch(batch_size).repeat(repeat_count)\n",
    "        )\n",
    "\n",
    "        return data_set\n",
    "\n",
    "    return dataset_builder\n",
    "\n",
    "\n",
    "# bob\n",
    "def create_dataset_builder_bob(\n",
    "    batch_size=128,\n",
    "    repeat_count=5,\n",
    "):\n",
    "    def _parse_bob(row_sample, label):\n",
    "        import tensorflow as tf\n",
    "\n",
    "        y_t = label[\"Rating\"]\n",
    "        y = tf.expand_dims(\n",
    "            tf.where(\n",
    "                y_t > 3,\n",
    "                tf.ones_like(y_t, dtype=tf.float32),\n",
    "                tf.zeros_like(y_t, dtype=tf.float32),\n",
    "            ),\n",
    "            axis=1,\n",
    "        )\n",
    "        return row_sample, y\n",
    "\n",
    "    def dataset_builder(x):\n",
    "        import pandas as pd\n",
    "        import tensorflow as tf\n",
    "\n",
    "        x = [dict(t) if isinstance(t, pd.DataFrame) else t for t in x]\n",
    "        x = x[0] if len(x) == 1 else tuple(x)\n",
    "        data_set = (\n",
    "            tf.data.Dataset.from_tensor_slices(x).batch(batch_size).repeat(repeat_count)\n",
    "        )\n",
    "\n",
    "        data_set = data_set.map(_parse_bob)\n",
    "\n",
    "        return data_set\n",
    "\n",
    "    return dataset_builder\n",
    "\n",
    "\n",
    "data_builder_dict = {\n",
    "    alice: create_dataset_builder_alice(\n",
    "        batch_size=128,\n",
    "        repeat_count=5,\n",
    "    ),\n",
    "    bob: create_dataset_builder_bob(\n",
    "        batch_size=128,\n",
    "        repeat_count=5,\n",
    "    ),\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "73144726",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Use the packaged DeepFMBase and DeepFMFuse to define the model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "e113a8f5",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2023-07-25 11:26:23.897995: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/rh/devtoolset-11/root/usr/lib64:/opt/rh/devtoolset-11/root/usr/lib:/opt/rh/devtoolset-11/root/usr/lib64/dyninst:/opt/rh/devtoolset-11/root/usr/lib/dyninst\n",
      "2023-07-25 11:26:24.839817: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/rh/devtoolset-11/root/usr/lib64:/opt/rh/devtoolset-11/root/usr/lib:/opt/rh/devtoolset-11/root/usr/lib64/dyninst:/opt/rh/devtoolset-11/root/usr/lib/dyninst\n",
      "2023-07-25 11:26:24.839928: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/rh/devtoolset-11/root/usr/lib64:/opt/rh/devtoolset-11/root/usr/lib:/opt/rh/devtoolset-11/root/usr/lib64/dyninst:/opt/rh/devtoolset-11/root/usr/lib/dyninst\n",
      "2023-07-25 11:26:24.839940: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.\n"
     ]
    }
   ],
   "source": [
    "from secretflow_fl.ml.nn.applications.sl_deep_fm import DeepFMbase, DeepFMfuse\n",
    "from secretflow_fl.ml.nn import SLModel\n",
    "\n",
    "NUM_USERS = 6040\n",
    "NUM_MOVIES = 3952\n",
    "GENDER_VOCAB = [\"F\", \"M\"]\n",
    "AGE_VOCAB = [1, 18, 25, 35, 45, 50, 56]\n",
    "OCCUPATION_VOCAB = [i for i in range(21)]\n",
    "GENRES_VOCAB = [\n",
    "    \"Action\",\n",
    "    \"Adventure\",\n",
    "    \"Animation\",\n",
    "    \"Children's\",\n",
    "    \"Comedy\",\n",
    "    \"Crime\",\n",
    "    \"Documentary\",\n",
    "    \"Drama\",\n",
    "    \"Fantasy\",\n",
    "    \"Film-Noir\",\n",
    "    \"Horror\",\n",
    "    \"Musical\",\n",
    "    \"Mystery\",\n",
    "    \"Romance\",\n",
    "    \"Sci-Fi\",\n",
    "    \"Thriller\",\n",
    "    \"War\",\n",
    "    \"Western\",\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7501b19d",
   "metadata": {},
   "source": [
    "### Define Basenet    \n",
    "DeepFMBase has 4 parameters:       \n",
    "- dnn_units_size:  This parameter needs to provide a list to define the dnn part, such as [256,32] means that the two hidden layers in the middle are 256 and 32 respectively    \n",
    "- dnn_activation:  activation function of dnn, eg:relu      \n",
    "- preprocess_layer:  Need to process the input, pass in a defined keras.preprocesslayer      \n",
    "- fm_embedding_dim:  Dimensions of fm vector       \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "9a055624",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define alice's basenet\n",
    "def create_base_model_alice():\n",
    "    # Create model\n",
    "    def create_model():\n",
    "        import tensorflow as tf\n",
    "\n",
    "        def preprocess():\n",
    "            inputs = {\n",
    "                \"UserID\": tf.keras.Input(shape=(1,), dtype=tf.string),\n",
    "                \"Gender\": tf.keras.Input(shape=(1,), dtype=tf.string),\n",
    "                \"Age\": tf.keras.Input(shape=(1,), dtype=tf.int64),\n",
    "                \"Occupation\": tf.keras.Input(shape=(1,), dtype=tf.int64),\n",
    "            }\n",
    "            user_id_output = tf.keras.layers.Hashing(\n",
    "                num_bins=NUM_USERS, output_mode=\"one_hot\"\n",
    "            )\n",
    "            user_gender_output = tf.keras.layers.StringLookup(\n",
    "                vocabulary=GENDER_VOCAB, output_mode=\"one_hot\"\n",
    "            )\n",
    "\n",
    "            user_age_out = tf.keras.layers.IntegerLookup(\n",
    "                vocabulary=AGE_VOCAB, output_mode=\"one_hot\"\n",
    "            )\n",
    "            user_occupation_out = tf.keras.layers.IntegerLookup(\n",
    "                vocabulary=OCCUPATION_VOCAB, output_mode=\"one_hot\"\n",
    "            )\n",
    "\n",
    "            outputs = {\n",
    "                \"UserID\": user_id_output(inputs[\"UserID\"]),\n",
    "                \"Gender\": user_gender_output(inputs[\"Gender\"]),\n",
    "                \"Age\": user_age_out(inputs[\"Age\"]),\n",
    "                \"Occupation\": user_occupation_out(inputs[\"Occupation\"]),\n",
    "            }\n",
    "            return tf.keras.Model(inputs=inputs, outputs=outputs)\n",
    "\n",
    "        preprocess_layer = preprocess()\n",
    "        model = DeepFMbase(\n",
    "            dnn_units_size=[256, 32],\n",
    "            preprocess_layer=preprocess_layer,\n",
    "        )\n",
    "        model.compile(\n",
    "            loss=tf.keras.losses.binary_crossentropy,\n",
    "            optimizer=tf.keras.optimizers.Adam(),\n",
    "            metrics=[\n",
    "                tf.keras.metrics.AUC(),\n",
    "                tf.keras.metrics.Precision(),\n",
    "                tf.keras.metrics.Recall(),\n",
    "            ],\n",
    "        )\n",
    "        return model  # need wrap\n",
    "\n",
    "    return create_model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "36b21085",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define bob's basenet\n",
    "def create_base_model_bob():\n",
    "    # Create model\n",
    "    def create_model():\n",
    "        import tensorflow as tf\n",
    "\n",
    "        # define preprocess layer\n",
    "        def preprocess():\n",
    "            inputs = {\n",
    "                \"MovieID\": tf.keras.Input(shape=(1,), dtype=tf.string),\n",
    "                \"Genres\": tf.keras.Input(shape=(1,), dtype=tf.string),\n",
    "            }\n",
    "\n",
    "            movie_id_out = tf.keras.layers.Hashing(\n",
    "                num_bins=NUM_MOVIES, output_mode=\"one_hot\"\n",
    "            )\n",
    "            movie_genres_out = tf.keras.layers.TextVectorization(\n",
    "                output_mode='multi_hot', split=\"whitespace\", vocabulary=GENRES_VOCAB\n",
    "            )\n",
    "            outputs = {\n",
    "                \"MovieID\": movie_id_out(inputs[\"MovieID\"]),\n",
    "                \"Genres\": movie_genres_out(inputs[\"Genres\"]),\n",
    "            }\n",
    "            return tf.keras.Model(inputs=inputs, outputs=outputs)\n",
    "\n",
    "        preprocess_layer = preprocess()\n",
    "\n",
    "        model = DeepFMbase(\n",
    "            dnn_units_size=[256, 32],\n",
    "            preprocess_layer=preprocess_layer,\n",
    "        )\n",
    "        model.compile(\n",
    "            loss=tf.keras.losses.binary_crossentropy,\n",
    "            optimizer=tf.keras.optimizers.Adam(),\n",
    "            metrics=[\n",
    "                tf.keras.metrics.AUC(),\n",
    "                tf.keras.metrics.Precision(),\n",
    "                tf.keras.metrics.Recall(),\n",
    "            ],\n",
    "        )\n",
    "        return model  # need wrap\n",
    "\n",
    "    return create_model"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "236461c6",
   "metadata": {},
   "source": [
    "##  Define Fusenet"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "4e5f866a",
   "metadata": {},
   "outputs": [],
   "source": [
    "def create_fuse_model():\n",
    "    # Create model\n",
    "    def create_model():\n",
    "        import tensorflow as tf\n",
    "\n",
    "        model = DeepFMfuse(dnn_units_size=[256, 256, 32])\n",
    "        model.compile(\n",
    "            loss=tf.keras.losses.binary_crossentropy,\n",
    "            optimizer=tf.keras.optimizers.Adam(),\n",
    "            metrics=[\n",
    "                tf.keras.metrics.AUC(),\n",
    "                tf.keras.metrics.Precision(),\n",
    "                tf.keras.metrics.Recall(),\n",
    "            ],\n",
    "        )\n",
    "        return model\n",
    "\n",
    "    return create_model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "e4fcd9e7",
   "metadata": {},
   "outputs": [],
   "source": [
    "base_model_dict = {alice: create_base_model_alice(), bob: create_base_model_bob()}\n",
    "model_fuse = create_fuse_model()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6259e171",
   "metadata": {},
   "source": [
    "## Run it"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "ab0287fc",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "INFO:root:Create proxy actor <class 'secretflow_fl.ml.nn.sl.backend.tensorflow.sl_base.PYUSLTFModel'> with party alice.\n",
      "INFO:root:Create proxy actor <class 'secretflow_fl.ml.nn.sl.backend.tensorflow.sl_base.PYUSLTFModel'> with party bob.\n",
      "INFO:root:SL Train Params: {'self': <secretflow_fl.ml.nn.sl.sl_model.SLModel object at 0x7f993cf22760>, 'x': VDataFrame(partitions={PYURuntime(alice): Partition(data=<secretflow.device.device.pyu.PYUObject object at 0x7f993cf2e7f0>), PYURuntime(bob): Partition(data=<secretflow.device.device.pyu.PYUObject object at 0x7f993ced0250>)}, aligned=True), 'y': VDataFrame(partitions={PYURuntime(bob): Partition(data=<secretflow.device.device.pyu.PYUObject object at 0x7f993cf3d190>)}, aligned=True), 'batch_size': 128, 'epochs': 5, 'verbose': 1, 'callbacks': None, 'validation_data': None, 'shuffle': False, 'sample_weight': None, 'validation_freq': 1, 'dp_spent_step_freq': None, 'dataset_builder': {PYURuntime(alice): <function create_dataset_builder_alice.<locals>.dataset_builder at 0x7f9a2074f550>, PYURuntime(bob): <function create_dataset_builder_bob.<locals>.dataset_builder at 0x7f9a2074f940>}, 'audit_log_params': {}, 'random_seed': 1234, 'audit_log_dir': None}\n",
      "  0%|          | 0/8 [00:00<?, ?it/s]2023-07-25 11:26:40.221001: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected\n",
      "100%|██████████| 8/8 [00:08<00:00,  1.02s/it, epoch: 1/5 -  train_loss:0.7806562781333923  train_auc_1:0.4967503547668457  train_precision_1:0.628742516040802  train_recall_1:0.16746412217617035 ]\n",
      "100%|██████████| 8/8 [00:00<00:00, 15.96it/s, epoch: 2/5 -  train_loss:0.6794981956481934  train_auc_1:0.5373517870903015  train_precision_1:0.6171548366546631  train_recall_1:0.9409888386726379 ]\n",
      "100%|██████████| 8/8 [00:00<00:00, 14.90it/s, epoch: 3/5 -  train_loss:0.6782025694847107  train_auc_1:0.5984559655189514  train_precision_1:0.6269999742507935  train_recall_1:1.0 ]\n",
      "100%|██████████| 8/8 [00:00<00:00, 15.26it/s, epoch: 4/5 -  train_loss:0.6483343243598938  train_auc_1:0.6253233551979065  train_precision_1:0.7081005573272705  train_recall_1:0.8086124658584595 ]\n",
      "100%|██████████| 8/8 [00:00<00:00, 15.86it/s, epoch: 5/5 -  train_loss:0.6321477293968201  train_auc_1:0.6638531684875488  train_precision_1:0.6795511245727539  train_recall_1:0.8692185282707214 ]\n"
     ]
    }
   ],
   "source": [
    "from secretflow.data.vertical import read_csv as v_read_csv\n",
    "\n",
    "vdf = v_read_csv(\n",
    "    {alice: \"alice_ml1m.csv\", bob: \"bob_ml1m.csv\"}, keys=\"ID\", drop_keys=\"ID\"\n",
    ")\n",
    "label = vdf[\"Rating\"]\n",
    "\n",
    "data = vdf.drop(columns=[\"Rating\", \"Timestamp\", \"Title\", \"Zip-code\"])\n",
    "data[\"UserID\"] = data[\"UserID\"].astype(\"string\")\n",
    "data[\"MovieID\"] = data[\"MovieID\"].astype(\"string\")\n",
    "\n",
    "sl_model = SLModel(\n",
    "    base_model_dict=base_model_dict,\n",
    "    device_y=bob,\n",
    "    model_fuse=model_fuse,\n",
    ")\n",
    "history = sl_model.fit(\n",
    "    data,\n",
    "    label,\n",
    "    epochs=5,\n",
    "    batch_size=128,\n",
    "    random_seed=1234,\n",
    "    dataset_builder=data_builder_dict,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3dcce654",
   "metadata": {},
   "source": [
    "So far, we have used the deepfm package provided by SecretFlow to complete the recommendation task training on the **MovieLens** dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "92dc699c",
   "metadata": {},
   "source": [
    "## Summarize\n",
    "This article demonstrates how to use SecretFlow to implement DeepFM through the recommendation task on the **MovieLens** dataset    \n",
    "you need:     \n",
    "1. Download and split the dataset      \n",
    "2. Define the dataloader for data processing         \n",
    "3. Define the preprocesslayer for data preprocessing, define the dnn structure, and call DeepFMBase and DeepFMFuse to define the model        \n",
    "4. Use SLModel for training, prediction, and evaluation.      \n",
    "You can try it on your own dataset, if you have any questions, you can discuss it on github      "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.17"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
